Skip to content

Conversation

@hemidactylus
Copy link
Collaborator

@hemidactylus hemidactylus commented Feb 13, 2025

Clean separation of codecs/vectorstore layers

This PR introduces a better separation of knowledge on what pertains to codecs and what constitutes the underlying logic of the Vector Store. As such,

fixes #106

Meanwhile, it provides a couple of useful querying tools which feel like they belong to the vstore+codec layer. (though another such tool, the "run_query" method, is postponed to a follow-up PR. That one is possibly the most important of these.)

Additionally, this restructuring of the code also makes a step toward a possible extension to cover API Tables without duplicating logic.

More in detail:

  • creation of "id queries" in the codec consistently (away from vectorstore)
  • moved id- and $vector-related parts of codec into the base class (not expected to vary under the data api)
  • codecs expose their "default indexing policy", vectorstore uses that knowledge in and around its constructor ( --> for coll. creation in particular)
  • codecs expose their "abstract metadata key to actual dot-notation field identifier" for building include/exclude policies specified by metadata fields
  • vectorstore does not say "_id" literally anymore (all is in codec; uses get_id)

A note on the name chosen for the get_id codec method. There is the unfortunate fact that LangChain calls "documents" its internal format, and Astra DB calls "document" what it stores. So ... codecs that lie between these two have sometimes a hard time with naming. Keeping it simple (get_id) should work in this case because only one of the two ends of the codec (the Astra side) can have a variable schema: then, only on that side should the need arise for a function abstracting the reading of an ID. (Which btw is more a formality as the "_id" field is one of those that will hardly ever change in Astra DB!)

@hemidactylus hemidactylus requested a review from epinzur February 13, 2025 16:54
@hemidactylus hemidactylus changed the title Sl vs full delegate to codec Vector Store, full separation codec / vectorstore Feb 14, 2025
"""
return _default_encode_id(filter_id)

def encode_ids(self, filter_ids: list[str]) -> dict[str, Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two potential issues with this being a separate method:

  1. For a user that has a query for "these IDs AND this predicate" they're going to need to write the _id filter themselves unless they have access to the codec.
  2. We'd also need the rewriting that happens to special case the _id filter and not rewrite it to metadata._id.

I wonder if we should just adopt a $id or _id as the standar field for the id. Then, I don't know if we'd need the encode_ids -- we could just make the rewriter do the right thing in that case, and the user can provide { "$id": <id> } or { "$id": { "$in": [<ids>] }} depending on their needs in arbitrary places within the query.

doc_id = self.document_codec.get_id(document)
return await _async_collection.replace_one(
{"_id": document["_id"]},
self.document_codec.encode_id(doc_id),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of allowing the codec to just be a filter rewriter actually would be pretty useful here as well. It would allow these to just be "do a query with codec.rewrite_filters({"$id": doc_id}) (don't remember the name it currently has) which feels like it's clearer what is being done. Then instead of having a bunch of methods on the codec that are used for creating queries, there is just the one that converts a "standard" query (on the Document basically) into a appropriately encoded query.

@hemidactylus hemidactylus added the do_not_merge Do not merge yet, requires further discussion label Feb 17, 2025
@hemidactylus
Copy link
Collaborator Author

@bjchambers Thank you for taking the time to review this. You make good points.
And you are right, a refinement of this separation is in order.

I have added the "do not merge" label while I go over it again. Here are the two points I take from your remarks.

  • I would still keep the codec unaware of logic such as "these filters are too many, let's split" (this does not belog there)
  • but I agree the codec should expose a more general method, which just translates "any query"

More on the second point: "any query", in LC world -- where Document objects have just: (1) id, (2) a vector, (3) metadata kv pairs and (4) an unindexed text -- means that the most general thing to query is:

  • zero, one or more IDs (if more: implied OR)
  • a search vector with its k for ANN
  • metadata conditions. These are: k: [v1, v2...] ==> implied OR, but also {k1:v1, k2:v2...} where it can be AND/OR equally. We could assume AND between different keys (do we worry about loss of generality on this?) or we can keep a Data-API-like syntax here.

What I'm trying to get to is, since the codec's "query translator" makes an abstract query into a payload good to go to Data API, its input could be required to be not a dictionary (which arguably makes usage more error-prone), rather a specific structure: instances of some AstraDBDocumentQuery class, to be then made into dictionaries only with the knowledge of the encoding scheme.

I can probably find some time to rework this PR in this sense tomorrow - would that capture the essence of your remarks? (plus helping with clarity since it requires data classes to express the abstractness of queries? Or perhaps too unwieldy?)

@bjchambers
Copy link

I like that direction and the observation it doesn't all need to be in a dict. That will keep the difference between id and metadata["id"] clear too.

I wonder if a data class is necessary. Could it just be:

def encode_query(self, ids: Iterable[str | int] = (), metadata: dict[str, Any] = {}):

I think keeping vector separate (it goes to sort, not the filter) could be reasonable.

@hemidactylus
Copy link
Collaborator Author

hemidactylus commented Feb 17, 2025

Right, and the metadata would be allowed to be anything (i.e. nested AND, OR, whatever). The current mechanism to rewrite with prefix unless it's a $-operator would be enough. Not a codec's responsibilty to split filters if they're too bulky.

Also agree that at this point the data class becomes useless weight. You convinced me: no data class :)

@hemidactylus
Copy link
Collaborator Author

I have replaced the "encode_id[s]" for a encode_query along the lines we discussed.

Some notes:

  • encode_query assumes the IDs are in AND with the metadata conditions (if both passed). (I think unlikely one wants and OR between those - in which case, running multiple queries and merging the results would be the way to go I believe)
  • the second parameter is named filter_dict for compatibility with the name throughout the VectorStore class, where it always means "metadata filters"
  • ids needs not be typed as str | int. In Langchain IDs are always strings I believe

I have adapted the unit tests (test_vs_doc_codecs.py, tests test_flat/default_query_encoding). Below a summary of the wonders of encode_query for your convenience:

from langchain_astradb.utils.vector_store_codecs import _DefaultVSDocumentCodec, _FlatVSDocumentCodec
d = _DefaultVSDocumentCodec(content_field='c', ignore_invalid_documents=False)
f = _FlatVSDocumentCodec(content_field='c', ignore_invalid_documents=False)


d.encode_query()
f.encode_query()
# both: {}


d.encode_query(ids=['id1'])
f.encode_query(ids=['id1'])
# both: {'_id': 'id1'}


d.encode_query(ids=['id1', 'id2'])
f.encode_query(ids=['id1', 'id2'])
# both: {'_id': {'$in': ['id1', 'id2']}}


d.encode_query(ids=['d'],filter_dict={'x':'y'})
f.encode_query(ids=['d'],filter_dict={'x':'y'})
# resp.:
#   {'$and': [{'_id': 'd'}, {'metadata.x': 'y'}]}
#   {'$and': [{'_id': 'd'}, {'x': 'y'}]}


d.encode_query(ids=['d'],filter_dict={'x':'y','z':'w'})
f.encode_query(ids=['d'],filter_dict={'x':'y','z':'w'})
# resp.:
#   {'$and': [{'_id': 'd'}, {'metadata.x': 'y', 'metadata.z': 'w'}]}
#   {'$and': [{'_id': 'd'}, {'x': 'y', 'z': 'w'}]}


d.encode_query(ids=['d'],filter_dict={'$or':[{'x':'y'},{'z':'w'}]})
f.encode_query(ids=['d'],filter_dict={'$or':[{'x':'y'},{'z':'w'}]})
# resp.:
#   {'$and': [{'_id': 'd'}, {'$or': [{'metadata.x': 'y'}, {'metadata.z': 'w'}]}]}
#   {'$and': [{'_id': 'd'}, {'$or': [{'x': 'y'}, {'z': 'w'}]}]}

@hemidactylus
Copy link
Collaborator Author

the difference between id and metadata["id"]

(an incidental note is that trying to use metadata._id as a literal metadata field may stop working as soon as one has a "flat" vector store - for reasons unrelated to this PR, more fundamental. I believe however that the code should not throw an error - suppose there is a legacy vectorstore out there that uses such a metadata field, created prior to the introduction of flat codecs and autodetect. Certainly not something to encourage, ...)

*,
ids: Iterable[str] | None = None,
filter_dict: dict[str, Any] | None = None,
) -> dict[str, Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May merit pydoc indicating the implicit $and.


if clauses:
if len(clauses) > 1:
return {"$and": clauses}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this makes sense. In general, my rationale is:

  1. If you want ids OR filter, then run separate queries -- the IDs only query is likely to be fast and the filter-only query is likely to do a scan.
  2. If you want ids AND filter then there is no option beyond running them together (unless you somehow emulate the full filtering semantics on the client side).

So, it seems like the AND is the only reasonable choice.

@hemidactylus hemidactylus removed the do_not_merge Do not merge yet, requires further discussion label Feb 18, 2025
@hemidactylus hemidactylus merged commit 3715b62 into main Feb 18, 2025
13 checks passed
@hemidactylus hemidactylus deleted the SL-vs-full-delegate-to-codec branch February 18, 2025 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistency between codec and metadata_indexing

3 participants